perf(validate): SIMD UTF-8 validator + measurement infrastructure (draft for analysis) by membphis · Pull Request #50 · api7/lua-qjson

membphis · 2026-05-22T11:08:52Z

Status

Draft for performance analysis. Correctness is verified across all paths; the AVX2 SIMD validator's CJK speedup landed at ~14% rather than the 4× target in the spec. Handing off for deeper analysis before deciding the path forward.

Summary

This branch combines two logically related sub-projects (~14 commits, would normally be 2 stacked PRs):

Bench infrastructure (commits 2db82d7..151d0ea, 6 commits)

CJK chat-completion fixture (benches/fixtures/medium_resp_cjk.json, 60368 B byte-aligned with the existing ASCII fixture)
Rust criterion harness (benches/parse_eager.rs) — measures Document::parse_with_options end-to-end on the 4-fixture × 2-mode matrix
Makefile split (bench is now composite over bench-rust + bench-lua)
CI: cargo build --release --benches step to prevent bit-rot
README / docs / CLAUDE.md updates

SIMD UTF-8 validator (commits c0aaedd..7e77b51, 8 commits)

Cross-backend property test (tests/string_validate_crosscheck.rs, proptest 2000 cases) — scalar ≡ avx2 on arbitrary byte sequences
3 bad-UTF-8 reject fixtures + tests (truncated lead, overlong, surrogate)
2 new bench fixtures: mixed-script (37% high-bit) and emoji-heavy (80% high-bit)
AVX2 validator rewrite: 3-tier dispatch (ASCII fast-path / Lemire-Keiser `lookup4` / scalar fallback)
Two perf fixes during development:
1. Hoisted lookup4 table broadcasts + de-duplicated carry-prefix extraction
2. Fused `vpor + vpmovmskb` to recover ASCII fast-path performance (24% regression initially)

Bench results (Zen 2 AMD EPYC-Rome, 4 vCPU)

Numbers compared against the pre-SIMD baseline saved at commit `f7f07af`:

Bench	Pre-SIMD	New	Δ	Spec target	Pass
`parse/ascii/eager`	16.4 GiB/s	16.4 GiB/s	−0.9%	≤ 5% regression	✅
`parse/cjk/eager`	1.07 GiB/s	1.19 GiB/s	+14.3%	≥ 4× (4.4 GiB/s)	❌
`parse/mixed/eager`	1.15 GiB/s	1.16 GiB/s	~0%	≥ 2×	❌
`parse/emoji/eager`	1.16 GiB/s	1.22 GiB/s	+6.3%	≥ 2×	❌
4× `*/lazy`	~40 GiB/s	~40 GiB/s	±5%	±3%	⚠️ within noise

ASCII passes. CJK / mixed / emoji eager paths improve, but far below the 4× / 2× / 2× targets in the spec.

What we know about the gap

A debug investigation (see `tests/string_validate_crosscheck.proptest-regressions` for the captured regression seed that helped pin down the algorithm):

Cross-lane permute cost on Zen 2. `lookup4` uses 3 `_mm256_permute2x128_si256` per chunk for prev1/prev2/prev3 shifts. Zen 2's cross-lane permute is ~3 cycle latency / 1 cycle throughput. Roughly ~26 SIMD ops per Tier-2 chunk in total.
Scalar fallback is faster than expected on uniform CJK content. Branch predictor handles the regular 3-byte sequence pattern well; effective ~1 cycle/byte. The "scalar baseline is slow" assumption that justifies SIMD validation isn't holding up on this specific workload + CPU.
The 24% ASCII regression (now fixed) was traced to the dispatch losing the compiler's mask-fusion optimization. Pre-SIMD code used the signed-cmpgt trick to detect ctrl + high-bit bytes in one `vpmovmskb`; the new 3-tier dispatch broke that by needing `high` separated from `ctrl|bs`. The fix in `7e77b51` restores the fusion via `vpor(cb_v, chunk_raw)` before the single movemask, with a second movemask only on the slow path.

What the analyst might look at

Is the lookup4 inner loop actually executing for CJK chunks, or is Tier 3 (scalar fallback) firing more than expected? — `perf stat` would tell us, but the bench is in a VM.
Cross-lane permute alternatives: can we avoid `_mm256_permute2x128_si256` and instead use `_mm_alignr_epi8` on two 128-bit halves? Within-lane ops are 1 cycle on Zen 2.
Different algorithm: `std::str::from_utf8`'s SWAR inner loop, hyperscan-style 2-state DFA SIMD, simdjson's "lookup3" variant, etc.
Is the bench's `Document::parse_with_options` measuring what we think? Maybe Phase 1 scanner is itself the bottleneck on CJK, not the validator.

Correctness verification

Cross-check property test: 2000 proptest cases pass (scalar ≡ avx2 on arbitrary byte sequences)
Bad-UTF-8 reject fixtures: 3/3 pass (truncated, overlong, surrogate)
Boundary tests: 2/2 pass (3-byte lead at chunk boundary cases)
Full `cargo test --release`: 297 tests across 18 binaries, all pass
Scalar-only build (`--no-default-features`): all pass
`cargo clippy --release --all-targets -- -D warnings`: clean

Test plan

CI matrix: cargo test (default + scalar-only + test-panic), cargo build --benches, Lua busted, LuaRocks package validation
Re-run cross-check at 20K cases (`PROPTEST_CASES=20000 cargo test --release --test string_validate_crosscheck`) before merging out of draft
Reproduce bench numbers on a non-virtualized host with `perf` available
Decide based on analysis: ship as-is with revised expectations / revert SIMD validator and keep only bench infra / try alternative algorithm

🤖 Generated with Claude Code

Mirrors medium_resp.json byte-for-byte (60368 B) but replaces the content field with 15000 × "中 " repetitions. The repeated 3-byte BMP CJK character forces the AVX2 string validator off its ASCII fast path on every chunk, exposing the scalar fallback cost that the upcoming SIMD UTF-8 validator targets.

Measures Document::parse_with_options end-to-end across ASCII and CJK fixtures in both EAGER and LAZY mode (4 benches total). Throughput is reported in MB/s. The eager-vs-lazy delta per fixture is the value-level validation cost that future SIMD optimizations target; the ASCII benches serve as a regression guard.

make bench now runs both Rust criterion and Lua-vs-cjson, matching the composition pattern used by make test. make bench-rust is the inner-loop target for SIMD tuning; make bench-lua preserves the existing user-facing comparison harness behavior unchanged.

make bench is now the composite suite; make bench-lua preserves the prior Lua-vs-cjson behavior, which is what the benchmarks page documents. make bench-rust is the new Rust criterion entry point.

cargo build --release --benches catches stale bench source on every PR without paying the cost (or accepting the non-determinism) of actually running benchmarks in CI.

Reflects make bench composition + the bench-rust / bench-lua sub-targets, and lists the new medium_resp_cjk.json fixture under benches/. Caught by the final review on the bench-infrastructure PR.

proptest property: validate_span_scalar and validate_span_avx2 must return byte-identical Result for any byte sequence (2000 cases per CI run). Mirrors tests/scanner_crosscheck.rs pattern. Passes on current code (AVX2 falls back to scalar on non-ASCII, so trivially identical); will catch any divergence introduced by the upcoming SIMD UTF-8 validator rewrite.

Three point tests for the three UTF-8 error classes the validator must reject: truncated multi-byte sequence (0xC3 with no continuation), overlong encoding (0xC0 0x80 = U+0000 in 2 bytes), and UTF-16 surrogate (0xED 0xA0 0x80 = U+D800). Tests pass on the current scalar fallback; they guard against regressions in the upcoming SIMD validator rewrite.

medium_resp_mixed.json: 60368 B, content cycles "中 hello é world 😀 " (24 B/cycle × 2500 = 60000 B), exercising 1/2/3/4-byte UTF-8 sequences with frequent script transitions. medium_resp_emoji.json: 60368 B, content is "😀 " × 12000 (5 B/cycle), exercising the 4-byte UTF-8 lookup4 path under maximum pressure (80% high-bit ratio). Same skeleton and total byte count as medium_resp{,_cjk}.json so the bench's MB/s numbers are directly comparable across all four fixtures.

Adds parse/mixed/{eager,lazy} and parse/emoji/{eager,lazy} entries. Mixed exercises script-transition cost in the validator; emoji exercises 4-byte UTF-8 pressure. Lazy benches confirm scanner path is content-agnostic (all four should land within ±3% of each other). Numbers from this commit form the pre-simd baseline against which the upcoming AVX2 validator rewrite will be measured.

Replaces the AVX2 fast-path-and-fallback layer with a three-tier dispatch: Tier 1: pure printable ASCII chunks skip wholesale (unchanged). Tier 2: pure UTF-8 chunks (no control/backslash) run Lemire/Keiser lookup4 — 5 SIMD ops/chunk, no scalar fallback. Tier 3: chunks with control byte or backslash flush the lookup4 carry, then hand off to the scalar state machine. Carry state across chunks: prev_input (256-bit) feeds shift-with-carry into lookup4's prev1/prev2/prev3 inputs. err_acc accumulates per-chunk errors; checked at chunk boundary before Tier 3 handoff and at end of main loop. prev_ended_ascii flag is the safety interlock for Tier 1. Lookup tables transcribed verbatim from simdjson's utf8_lookup4_algorithm.h (Lemire & Keiser, 2020). Correctness verified by tests/string_validate_crosscheck.rs (2000 proptest cases vs scalar oracle) and three explicit bad-UTF-8 reject fixtures.

proptest captured the failing seed [0x00 × 30, 0x80, 0x00] that surfaced during the AVX2 lookup4 development — initial transcription of simdjson's tables had errors that this input revealed within seconds of running the cross-check. Committing the seed ensures the exact bug-pattern is replayed on every future run.

Three code-quality fixes from review of the AVX2 lookup4 commit: - lookup4_chunk no longer re-broadcasts BYTE1_HIGH / BYTE1_LOW / BYTE2_HIGH on every call; the three table vectors are computed once at the top of validate_span_avx2_impl and passed in. - prev_ended_ascii's safety role at the Tier 1 skip is now documented inline (was only in the module-level doc). - The 1-3 byte carry-prefix extraction from prev_input was duplicated in the Tier 3 fallback and the post-loop tail; extracted into a private extract_carry_prefix helper so both sites share one source of truth. No behavior change; cross-check property test still passes 2000 cases.

The three-tier dispatch in 47b8fb5 broke the compiler's ability to collapse the high/ctrl/bs masks into a single vpmovmskb, regressing parse/ascii/eager by ~24% on Zen 2 (port-0-only vpmovmskb at 1/cycle was the bottleneck). Fix: in the inner loop, build a 'cb_v = ctrl|bs' vector first, then detect any interesting byte (ctrl|bs|high) via vpor(cb_v, chunk_raw) followed by a single vpmovmskb. Only on the slow path do we compute a second movemask on cb_v to disambiguate Tier 2 (pure UTF-8) from Tier 3 (control byte or backslash). ASCII inner loop: back to 1 vpmovmskb per chunk. CJK / mixed / emoji paths unchanged (they take the slow path on every chunk, but the single extra movemask there is dwarfed by lookup4's cost). Cross-check property test still passes 2000 cases against scalar.

coderabbitai · 2026-05-22T11:08:59Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 50093a51-d3b7-4f36-9c0f-002ed2e1653b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-perf-amd-zen2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

membphis · 2026-05-22T11:23:57Z

Performance analysis summary: the 2-4× CJK/mixed/emoji targets are unattainable with SIMD UTF-8 validation alone.

Root cause

LAZY mode (~40 GiB/s on all fixtures) is the hard upper bound for EAGER. The structural scan + depth check alone costs ~1.5 μs for 60 KB. EAGER mode spends ~54.5 μs in validate_eager_values, but UTF-8 validation is only a fraction of that — the grammar state machine walking indices and memory access are fixed costs that no validator can eliminate.

On uniform CJK, the scalar validator already runs at ~1 cycle/byte. The CPU achieves this via perfect branch prediction (every byte is a 3-byte lead → same code path every time) and OOO execution running multiple iterations in flight. SIMD lookup4 does ~26 ops per 32 bytes ≈ 1.6 cycles/byte on Zen 2 after accounting for 3× cross-lane permutes (_mm256_permute2x128_si256, 3 cycle latency each). The SIMD path has more instructions per byte than scalar on this workload — it cannot realistically exceed 1.3-1.5×.

What could still be squeezed out

Optimization	Est. gain	Notes
Lane-local 128-bit ops instead of permute2x128	+10-15%	`_mm_alignr_epi8` is 1 cycle vs 3
CJK-uniform fast path (skip lookup4)	+5-10%	Detect all-lead+continuation pattern
Total ceiling	+35-45%	Including the 14% already achieved

The 4× spec target was based on an assumption that scalar is "slow" — but on uniform non-ASCII data, scalar is actually very close to optimal due to branch prediction + OOO. There is no architectural path to 2-4×.

Recommendation

Keep the bench infrastructure commits (valuable), revert or shelve the SIMD validator commits. If CJK parse throughput is a priority, the real 40× leverage is LAZY mode (40 GiB/s) — validate on access rather than eagerly.

membphis · 2026-05-22T11:24:03Z

Closing — performance ceiling fully characterized. See analysis in the comment above.

membphis added 14 commits May 22, 2026 07:55

docs: update make bench references for split Rust/Lua targets

aa6ba31

make bench is now the composite suite; make bench-lua preserves the prior Lua-vs-cjson behavior, which is what the benchmarks page documents. make bench-rust is the new Rust criterion entry point.

ci: compile benches in rust job to prevent bit-rot

039a2b5

cargo build --release --benches catches stale bench source on every PR without paying the cost (or accepting the non-determinism) of actually running benchmarks in CI.

docs: refresh CLAUDE.md for split bench targets

151d0ea

Reflects make bench composition + the bench-rust / bench-lua sub-targets, and lists the new medium_resp_cjk.json fixture under benches/. Caught by the final review on the bench-infrastructure PR.

membphis closed this May 22, 2026

membphis deleted the worktree-perf-amd-zen2 branch May 22, 2026 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(validate): SIMD UTF-8 validator + measurement infrastructure (draft for analysis)#50

perf(validate): SIMD UTF-8 validator + measurement infrastructure (draft for analysis)#50
membphis wants to merge 14 commits into
mainfrom
worktree-perf-amd-zen2

membphis commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026

Review skipped

Uh oh!

membphis commented May 22, 2026

Uh oh!

membphis commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

membphis commented May 22, 2026

Status

Summary

Bench results (Zen 2 AMD EPYC-Rome, 4 vCPU)

What we know about the gap

What the analyst might look at

Correctness verification

Test plan

Uh oh!

coderabbitai Bot commented May 22, 2026

Review skipped

Uh oh!

membphis commented May 22, 2026

Root cause

What could still be squeezed out

Recommendation

Uh oh!

membphis commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant